data <- read.csv('car93.csv')
pca_cars <- prcomp(data[,-1:-3], scale.=TRUE)
print(summary(pca_cars))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.1704 1.2840 0.97225 0.76321 0.6504 0.5732 0.53562
## Proportion of Variance 0.6701 0.1099 0.06302 0.03883 0.0282 0.0219 0.01913
## Cumulative Proportion 0.6701 0.7800 0.84304 0.88187 0.9101 0.9320 0.95110
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.44946 0.41551 0.34330 0.26369 0.26018 0.20809 0.19842
## Proportion of Variance 0.01347 0.01151 0.00786 0.00464 0.00451 0.00289 0.00262
## Cumulative Proportion 0.96457 0.97608 0.98393 0.98857 0.99308 0.99597 0.99859
## PC15
## Standard deviation 0.14522
## Proportion of Variance 0.00141
## Cumulative Proportion 1.00000
biplot(pca_cars)
print(pca_cars$rotation[,1:2])
## PC1 PC2
## Price 0.2204708 0.37225168
## MPG.city -0.2686351 -0.19307271
## MPG.highway -0.2428963 -0.28759336
## EngineSize 0.2993827 -0.05044005
## Horsepower 0.2504771 0.39222489
## RPM -0.1427112 0.52833268
## Rev.per.mile -0.2626478 0.12094301
## Fuel.tank.capacity 0.2819184 0.16643949
## Length 0.2911181 -0.11364032
## Wheelbase 0.2890445 -0.10287701
## Width 0.2861280 -0.10943488
## Turn.circle 0.2633530 -0.13849282
## Rear.seat.room 0.1806914 -0.27829245
## Luggage.room 0.2299132 -0.34560414
## Weight 0.3065978 0.10976982
PC1
From the first principal component, I can see majority of the
coefficients are positive except for 4 which are negative (MPG.city,
MPG.highway, RPM, Rev.per.mile). All 4 of the negative coefficients are
some sort of rate, like the miles per gallon in the city of highway, and
revolutions per minute or per mile. The largest coefficient in magnitude
is weight while the smallest is RPM.
PC2 In the second principal component, there is a larger variance in the coefficients than the first principal component with the largest coefficient in magnitude being RPM and the smallest being EngineSize.There are more coefficients closer to 0, indicating a lot of the variation in those predictors was captured in the first component. In addition, the negative coefficients in the 2nd component have something to do with distance or size, but the grouping is not as clear as in the first component, also indicating the variation being captured is lower than in the first component.
print(summary(pca_cars))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.1704 1.2840 0.97225 0.76321 0.6504 0.5732 0.53562
## Proportion of Variance 0.6701 0.1099 0.06302 0.03883 0.0282 0.0219 0.01913
## Cumulative Proportion 0.6701 0.7800 0.84304 0.88187 0.9101 0.9320 0.95110
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.44946 0.41551 0.34330 0.26369 0.26018 0.20809 0.19842
## Proportion of Variance 0.01347 0.01151 0.00786 0.00464 0.00451 0.00289 0.00262
## Cumulative Proportion 0.96457 0.97608 0.98393 0.98857 0.99308 0.99597 0.99859
## PC15
## Standard deviation 0.14522
## Proportion of Variance 0.00141
## Cumulative Proportion 1.00000
plot(pca_cars, type="lines")
I. Kaiser Criterion: For the Kaiser criterion, for
scaled PCA, keep all principal components where the variance/standard
deviation is greater than or equal to 1, and so the first 2 principal
components should be kept.
II. Cumulative Proportion: To keep 90% of the variation
in the data, the first 4 principal components capture about 88% of the
variation, so in order to capture 90%, the first 5 principal components
must be kept.
III. Scree Plot: Based on the scree plot, the graph seems to plateau around the 3rd component where 3 is not much different than 4 and 4 is not much different than 5. Thus, based on the scree plot, the first 2 principal components should be kept.
predictors_pca <- as.data.frame(pca_cars$x[, 1:2])
response <- ifelse(data$Type == "Small", "Small", "Not Small")
lda_cars <- lda(response ~ ., data = data.frame(response = response, predictors_pca), CV = TRUE)
y_1 <- ifelse(response == 'Not Small', 1, 0)
y_2 <- ifelse(response == 'Small', 1, 0)
y <- cbind(y_1, y_2)
log_product <- y * log(lda_cars$posterior)
log_loss <- -mean(log_product[log_product != 0])
print(log_loss)
## [1] 0.2216978
predictors_pca <- as.data.frame(pca_cars$x[, 1:2])
response <- data$Type
lda_cars <- lda(response ~ ., data = data.frame(response = response, predictors_pca), CV = TRUE)
y_1 <- ifelse(response == 'Compact', 1, 0)
y_2 <- ifelse(response == 'Large', 1, 0)
y_3 <- ifelse(response == 'Midsize', 1, 0)
y_4 <- ifelse(response == 'Small', 1, 0)
y_5 <- ifelse(response == 'Sporty', 1, 0)
y <- cbind(y_1, y_2, y_3, y_4, y_5)
log_product <- y * log(lda_cars$posterior)
log_loss <- -mean(log_product[log_product != 0])
print(log_loss)
## [1] 0.8125036
Based on the above, we see the log-loss where we have 2 categories of vehicle type (“Small” and “Not Small”) has a smaller log-loss than when we included all vehicle types. It is reasonable to assume that in order to classify a prediction from more categories, a better model or more understanding of the groups/data would be needed. However, we used the same model and same number of principal components (ie. the amount of variation being used is the same) and the log-loss in the 2nd iteration performed worse because the log-loss metric heavily penalizes any high-confident misclassifications.
This matches our discussion in which it was concluded that there is no guarantee that higher variance components will lead to more predictive power towards y. We also don’t know which principal components will explain the variation in y and so even though we only used the first 2 principal components, we may be able to get a lower log-loss had we used other principal components with lower variance being captured.
data <- banknote
print(summary(data[,-1:-2]))
## Left Right Bottom Top
## Min. :129.0 Min. :129.0 Min. : 7.200 Min. : 7.70
## 1st Qu.:129.9 1st Qu.:129.7 1st Qu.: 8.200 1st Qu.:10.10
## Median :130.2 Median :130.0 Median : 9.100 Median :10.60
## Mean :130.1 Mean :130.0 Mean : 9.418 Mean :10.65
## 3rd Qu.:130.4 3rd Qu.:130.2 3rd Qu.:10.600 3rd Qu.:11.20
## Max. :131.0 Max. :131.1 Max. :12.700 Max. :12.30
## Diagonal
## Min. :137.8
## 1st Qu.:139.5
## Median :140.4
## Mean :140.5
## 3rd Qu.:141.5
## Max. :142.4
print(cor(data[,-1:-2]))
## Left Right Bottom Top Diagonal
## Left 1.0000000 0.7432628 0.4137810 0.3623496 -0.5032290
## Right 0.7432628 1.0000000 0.4867577 0.4006702 -0.5164755
## Bottom 0.4137810 0.4867577 1.0000000 0.1418513 -0.6229827
## Top 0.3623496 0.4006702 0.1418513 1.0000000 -0.5940446
## Diagonal -0.5032290 -0.5164755 -0.6229827 -0.5940446 1.0000000
Based on the summary of the data, it must be scaled because the magnitude of Bottom and Top is much lower than the other predictors.
Scaled euclidean distance seems appropriate due to its simplicity. Scaled Manhattan distance seemed appealing due to the values being the perimeters of the banknote, ie left, right, top, and bottom edges, however, I wasn’t sure if travelling along the axes requirement applied to the bank note data. Mahalanobis was considered due to the correlation between the left and right predictors, but was dropped because the correlation amongst the other predictors was not as high, particularly bottom and top.
scaled_data <- scale(data[,-1:-2])
dist_matrix <- dist(scaled_data, method = "euclidean")
hc_single <- hclust(dist_matrix, method = "single")
hc_complete <- hclust(dist_matrix, method = "complete")
hc_avg <- hclust(dist_matrix, method = "average")
plot(hc_single, hang = -1, labels = rownames(data))
plot(hc_complete, hang = -1, labels = rownames(data))
plot(hc_avg, hang = -1, labels = rownames(data))
The dendrogram from the “complete” linkage method looks the best and the dendrogram from the “average” linkage method looks somewhat acceptable. The “single” linkage method looks strange, seems like there is a chaining pattern occurring.
The “complete” dendrogram indicates there are 2 groups and the “average” dendrograms indicates there are likely 3 groups, but I would ultimately go with the “complete” linkage method because the dendrogram looked the best and conclude there are likely 2 groups in the data.
clusters <- cutree(hc_complete, k=2)
# data$cluster <- as.factor(clusters)
conf_matrix <- table(data$Status, clusters)
print(conf_matrix)
## clusters
## 1 2
## counterfeit 100 0
## genuine 30 70
misclassification_rate <- 1 - sum(diag(conf_matrix)) / sum(conf_matrix)
print(misclassification_rate)
## [1] 0.15
set.seed(632)
k_means <- kmeans(scaled_data, 2, nstart=25)
conf_matrix <- table(data$Status, k_means$cluster)
print(conf_matrix)
##
## 1 2
## counterfeit 100 0
## genuine 8 92
misclassification_rate <- 1 - sum(diag(conf_matrix)) / sum(conf_matrix)
print(misclassification_rate)
## [1] 0.04
set.seed(632)
k_means <- kmeans(data[,-1:-2], 2, nstart=25)
conf_matrix <- table(data$Status, k_means$cluster)
print(conf_matrix)
##
## 1 2
## counterfeit 0 100
## genuine 100 0
n <- nrow(conf_matrix)
opposite_diag_values <- conf_matrix[cbind(n:1, 1:n)]
misclassification_rate <- 1 - sum(opposite_diag_values) / sum(conf_matrix)
print(misclassification_rate)
## [1] 0
A potential reason as to why the unscaled data performs better than the scaled version is because distance between data points for particular predictors may be important for clustering. As an example, the difference between the left/right sides of the banknote may be more important than the difference between top/bottom which are originally very far in terms of scale, but become standardized during the scaling process and that difference becomes less prominent.
The strong performance of unsupervised techniques on the banknote data set indicates there is an underlying structure/pattern within the data which allowed the unsupervised technique to be so successful. The strong performance also indicates there are important features which can be used for anomaly detection which can be used to determine if certain banknotes deviate from the norm. In general, the strong performance signifies some relationship which can be further explored/analyzed.
load("lots.Rdata")
data <- data.frame(datmat)
data$clusts <- as.factor(clusts)
plot <- ggplot(data, aes(x=X1, y=X2, color=clusts)) +
geom_point()
print(plot)
set.seed(1026)
scaled_data <- scale(data[,-3])
k_means <- kmeans(scaled_data, 20)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.7748961
set.seed(6201)
scaled_data <- scale(data[,-3])
k_means <- kmeans(scaled_data, 20)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.8971005
set.seed(1026)
scaled_data <- scale(data[,-3])
k_means <- kmeans(scaled_data, 20, nstart=1000)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.9567124
set.seed(6201)
scaled_data <- scale(data[,-3])
k_means <- kmeans(scaled_data, 20, nstart=1000)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.9641084
Based on the results from B and C, we get adjusted Rand Index’s that are not very close which indicates the random start is playing an important role in the clusters being formed by the k-means algorithm - more importantly it also raises the question if the k we selected of 20 is actually correct because the solutions seem to differ from the original values.
Moving to the results from part D and E, we see the adjusted Rand Index for both are very close, indicating the initial start is not as important and that we eventually do reach similar groups such that the adjusted Rand Index is almost the same. This also reinforces our selection of 20 groups being valid and concludes the clusters formed via k-means are very similar to the original groups.
a <- load('bsim.Rdata')
data <- data.frame(asim)
# initial data exploration
plot(data)
print(cor(data[,-1]))
## V2 V3 V4 V5 V6 V7
## V2 1.00000000 -0.7176512 -0.02516521 0.7073009 0.22288149 0.8288672
## V3 -0.71765119 1.0000000 -0.15587899 -0.6831011 -0.22714898 -0.7976255
## V4 -0.02516521 -0.1558790 1.00000000 0.1312994 -0.07314724 0.1521966
## V5 0.70730092 -0.6831011 0.13129941 1.0000000 0.24767181 0.8086329
## V6 0.22288149 -0.2271490 -0.07314724 0.2476718 1.00000000 0.2618375
## V7 0.82886719 -0.7976255 0.15219662 0.8086329 0.26183747 1.0000000
## V8 0.76762981 -0.6740252 0.04290202 0.6894926 0.20735075 0.7853305
## V9 -0.69315038 0.7150549 -0.17721595 -0.7314435 -0.28115894 -0.8374997
## V10 -0.47654425 0.4520752 -0.21329907 -0.4296194 -0.19552181 -0.5146011
## V8 V9 V10
## V2 0.76762981 -0.6931504 -0.4765443
## V3 -0.67402515 0.7150549 0.4520752
## V4 0.04290202 -0.1772159 -0.2132991
## V5 0.68949263 -0.7314435 -0.4296194
## V6 0.20735075 -0.2811589 -0.1955218
## V7 0.78533051 -0.8374997 -0.5146011
## V8 1.00000000 -0.6662969 -0.4336191
## V9 -0.66629686 1.0000000 0.4767139
## V10 -0.43361914 0.4767139 1.0000000
print(cov(data[,-1]))
## V2 V3 V4 V5 V6 V7
## V2 15.5514796 -11.871511 -0.2585403 15.149309 1.9420930 22.744933
## V3 -11.8715114 17.595982 -1.7034767 -15.563041 -2.1053666 -23.281964
## V4 -0.2585403 -1.703477 6.7870924 1.857836 -0.4210659 2.759058
## V5 15.1493088 -15.563041 1.8578361 29.498877 2.9722784 30.561033
## V6 1.9420930 -2.105367 -0.4210659 2.972278 4.8822539 4.025834
## V7 22.7449333 -23.281964 2.7590575 30.561033 4.0258338 48.420327
## V8 7.0193204 -6.556019 0.2591654 8.683399 1.0623636 12.671374
## V9 -11.3442795 12.448291 -1.9160564 -16.487208 -2.5782539 -24.185891
## V10 -4.8186775 4.862461 -1.4248515 -5.983088 -1.1077561 -9.181704
## V8 V9 V10
## V2 7.0193204 -11.344280 -4.818677
## V3 -6.5560195 12.448291 4.862461
## V4 0.2591654 -1.916056 -1.424852
## V5 8.6833988 -16.487208 -5.983088
## V6 1.0623636 -2.578254 -1.107756
## V7 12.6713740 -24.185891 -9.181704
## V8 5.3766870 -6.411930 -2.578129
## V9 -6.4119296 17.223729 5.072946
## V10 -2.5781287 5.072946 6.574724
# initial linear model
initial_linear <- lm(y ~. , data=data)
print(summary(initial_linear))
##
## Call:
## lm(formula = y ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.8122 -4.4968 0.1784 4.4572 27.3810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.2355 5.2878 -7.042 6.45e-12 ***
## V2 2.3483 0.1704 13.785 < 2e-16 ***
## V3 1.5074 0.1369 11.009 < 2e-16 ***
## V4 -0.5731 0.1406 -4.076 5.34e-05 ***
## V5 -2.2136 0.1080 -20.491 < 2e-16 ***
## V6 1.8948 0.1603 11.822 < 2e-16 ***
## V7 1.6712 0.1334 12.526 < 2e-16 ***
## V8 -1.7447 0.2503 -6.971 1.02e-11 ***
## V9 -0.3703 0.1528 -2.423 0.0158 *
## V10 3.2300 0.1573 20.530 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.473 on 490 degrees of freedom
## Multiple R-squared: 0.74, Adjusted R-squared: 0.7352
## F-statistic: 154.9 on 9 and 490 DF, p-value: < 2.2e-16
# PCA
pca_data <- prcomp(data[,-1], scale.=TRUE)
print(biplot(pca_data))
## NULL
print(summary(pca_data))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2679 1.0543 0.94114 0.79362 0.57914 0.56492 0.50627
## Proportion of Variance 0.5715 0.1235 0.09842 0.06998 0.03727 0.03546 0.02848
## Cumulative Proportion 0.5715 0.6950 0.79340 0.86338 0.90065 0.93611 0.96458
## PC8 PC9
## Standard deviation 0.46400 0.32163
## Proportion of Variance 0.02392 0.01149
## Cumulative Proportion 0.98851 1.00000
pca_predictors <- as.data.frame(pca_data$x[, 1:2])
pca_predictors <- data.frame(data[,1], pca_predictors)
colnames(pca_predictors)[colnames(pca_predictors) == 'data...1.'] <- 'y'
linear <- lm(y ~., data=pca_predictors)
summary(linear)
##
## Call:
## lm(formula = y ~ ., data = pca_predictors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.466 -8.128 0.716 7.981 45.308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.7747 0.5733 78.103 < 2e-16 ***
## PC1 1.1609 0.2530 4.588 5.68e-06 ***
## PC2 -6.0199 0.5443 -11.060 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.82 on 497 degrees of freedom
## Multiple R-squared: 0.2239, Adjusted R-squared: 0.2208
## F-statistic: 71.68 on 2 and 497 DF, p-value: < 2.2e-16
# Clustering
scaled_data <- data.frame(scale(data[,-1], scale=TRUE))
# Decide to go with 2 groups based on biplot from PCA
k_means <- kmeans(scaled_data, 2, nstart=1000)
temp_data <- data
temp_data$cluster <- k_means$cluster
linear_clusters <- lm(y ~ cluster*(V2+V3+V4+V5+V6+V7+V8+V9+V10) , data=temp_data)
print(summary(linear_clusters))
##
## Call:
## lm(formula = y ~ cluster * (V2 + V3 + V4 + V5 + V6 + V7 + V8 +
## V9 + V10), data = temp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.99101 -0.59457 0.03553 0.61065 2.98516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.31013 2.70073 -1.966 0.0499 *
## cluster 7.56585 1.76602 4.284 2.22e-05 ***
## V2 7.62656 0.08841 86.261 < 2e-16 ***
## V3 -1.72147 0.05798 -29.689 < 2e-16 ***
## V4 -6.59587 0.06230 -105.868 < 2e-16 ***
## V5 -2.07703 0.04572 -45.433 < 2e-16 ***
## V6 2.08090 0.06557 31.733 < 2e-16 ***
## V7 0.11227 0.07939 1.414 0.1580
## V8 -1.23255 0.10343 -11.917 < 2e-16 ***
## V9 -7.15220 0.08045 -88.907 < 2e-16 ***
## V10 2.59536 0.06502 39.914 < 2e-16 ***
## cluster:V2 -5.11942 0.07260 -70.516 < 2e-16 ***
## cluster:V3 1.90348 0.04313 44.138 < 2e-16 ***
## cluster:V4 4.27634 0.04067 105.159 < 2e-16 ***
## cluster:V5 -0.02066 0.03426 -0.603 0.5468
## cluster:V6 -0.43149 0.04341 -9.940 < 2e-16 ***
## cluster:V7 0.89608 0.05052 17.737 < 2e-16 ***
## cluster:V8 -0.73147 0.07931 -9.223 < 2e-16 ***
## cluster:V9 4.95254 0.05917 83.704 < 2e-16 ***
## cluster:V10 -0.23190 0.04657 -4.979 8.92e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9774 on 480 degrees of freedom
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.9955
## F-statistic: 5773 on 19 and 480 DF, p-value: < 2.2e-16
Based on the initial linear model, it has an R-squared value of about 0.74 which is well below 0.99 target. Performaning PCA and PCReg, we see the PCReg performs much worse at an R-squared value of 0.22, however, from the biplot, we can clearly see 2 groups.
After performing a K-means clustering with 2 groups, I re-try the linear model, but this time, I multiply the predictors by the cluster value resulting in a much better linear fit and achieveing the 0.99 R-squared target.